Arrow ipc Array -> Bytes codec#3613
Conversation
src/zarr/core/metadata/v3.py
Outdated
| if isinstance(dtype, VariableLengthUTF8) and codec_class_name not in ( | ||
| "VLenUTF8Codec", | ||
| "ArrowIPCCodec", | ||
| ): # type: ignore[unreachable] |
There was a problem hiding this comment.
This change allows us to use either the vlen-bytes or the arrow-ipc codec to encode variable length strings.
Flagging that this sort of logic for mapping codec / dtype compatibility feels quite brittle and non-scalable. But I don't have a better proposal in mind.
There was a problem hiding this comment.
i feel the same way! we might need a dtype x codecs compatibility matrix, not sure if it should track compatibility or incompatibility
|
@d-v-b - resolving the typing errors here is beyond my ability. Would appreciate your help. 🙏 |
|
ci is passing via 96273a8 |
| # Note: we only expect a single batch per chunk | ||
| record_batch = record_batch_reader.read_next_batch() | ||
| array = record_batch.column(self.column_name) | ||
| numpy_array = array.to_numpy() |
There was a problem hiding this comment.
Very happy to see this happening :)
I would be very curious about the behavior of non-standard types here. What does something like geometry dtype (which isn't in pyarrow) or DictionaryArray (which is in the core but has an implicit masking of sorts) do here? I can't deduce from the pyarrow docs exactly to be honest
Would it make sense to have a custom buffer class similar to what @keewis is doing for sparse (I think?)
Implementation of
arrow-ipcArray Bytes codec proposed in zarr-developers/zarr-extensions#41TODO:
docs/user-guide/*.mdchanges/